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Abstract 

Object viewpoint estimation from 2D images is an 
essential task in computer vision. However, two issues 
hinder its progress: scarcity of training data with viewpoint 
annotations, and a lack of powerful features. Inspired 
by the growing availability of 3D models, we propose a 
framework to address both issues by combining render- 
based image synthesis and CNNs. We believe that 3D 
models have the potential in generating a large number 
of images of high variation, which can be well exploited 
by deep CNN with a high learning capacity. Towards this 
goal, we propose a scalable and overfit-resistant image 
synthesis pipeline, together with a novel CNN specifically 
tailored for the viewpoint estimation task. Experimentally, 
we show that the viewpoint estimation from our pipeline 
can significantly outperform state-of-the-art methods on 
PASCAL 3D-\- benchmark. 


1. Introduction 

3D recognition is a cornerstone problem in many 
vision applications and has been widely studied. Despite 
its critical importance, existing approaches are far from 
robust when applied to cluttered real-world images. We 
believe that two issues have to be addressed to enable 
more successful methods: scarcity of training images with 
accurate viewpoint annotation, and a lack of powerful 
features specifically tailored for 3D tasks. 

The first issue, scarcity of images with accurate view¬ 
point annotation, is mostly due to the high cost of manual 
annotation, and the associated inaccuracies due to human 
error. Consequently, the largest 3D image dataset, PASCAL 
3D-I- [34], contains only ^22K images. As such, it is limited 
in diversity and scale compared with object classification 
datasets such as ImageNet, which contains millions of 
images [6]. 
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Figure 1. System overview. We synthesize training images by 
overlaying images rendered from large 3D model collections on 
top of real images. A CNN is trained to map images to the ground 
truth object viewpoints. The training data is a combination of real 
images and synthesized images. The learned CNN is applied to 
estimate the viewpoints of objects in real images. 


The second issue is a lack of powerful features specif¬ 
ically tailored for viewpoint estimation. Most 3D vision 
systems rely on features such as SIFT and HoG, which were 
designed primarily for classification and detection tasks. 
However, this is contrary to the recent finding — features 
learned by task-specific supervision leads to much better 
task performance [17, 12, 15]. Ideally, we want to learn 
stronger features by deep CNN. This, however, requires 
huge amount of viewpoint-annotated images. 

In this paper, we propose to address both issues by com¬ 
bining render-based image synthesis and CNNs, enabling us 
to learn discriminative features. We believe that 3D models 
have the potential to generate large number of images of 
high variation, which can be well exploited by deep CNN 
with a high learning capacity. 

The inspiration comes from our key observation: more 
and more high-quality 3D CAD models are available online. 
In particular, many geometric properties, such as symmetry 
and joint alignment, can be efficiently and reliably esti¬ 
mated by algorithms with limited human effort (Sec 2). By 
rendering the 3D models, we convert the rich information 


1 














carried by them into 3D annotations automatically. 

To explore the idea of “Render for CNN” for 3D tasks, 
we focus on the viewpoint estimation problem — for an 
input RGB image and a bounding box from an off-the-shelf 
detector, our goal is to estimate the viewpoint. 

To prepare training data for this task, we augment real 
images by synthesizing millions of highly diverse images. 
Several techniques are applied to increase the diversity of 
the synthesized dataset, in order to prevent the deep CNN 
from picking up unreliable patterns and push it to learn 
more robust features. 

To fully exploit this large-scale dataset, we design a deep 
CNN specifically tailored for the viewpoint estimation task. 
We formulate a class-dependent fine-grained viewpoint 
classification problem and solve the problem with a novel 
loss layer adapted for this task. 

The results are surprising: trained on a dataset con¬ 
taining millions of rendered images, our CNN-based 
viewpoint estimator significantly outperforms state-of-the- 
art methods, tested on real images from the challenging 
PASCAL 3D-\- dataset. 

In summary, our contributions are as follows: 

• We show that training CNN by massive synthetic data 
is an effective approach for 3D viewpoint estimation. 
In particular, we achieve state-of-the-art performance 
on benchmark data set; 

• Based upon existing 3D model repositories, we pro¬ 
pose a synthesis pipeline that generates millions of 
images with accurate viewpoint labels at negligible 
human cost. This pipeline is scalable, and the 
generated data is resistant to overfitting by CNN; 

• Leveraging on the big synthesized data set, we propose 
a fine-grained view classification formulation, with a 
loss function encouraging strong correlation of nearby 
views. This formulation allows us to accurately predict 
views and capture underlying viewpoint ambiguities. 

2. Related Work 

3D Model Datasets Prior work has focused on manually 
collecting organized 3D model datasets (e.g., [8, 11]). 

Recently, several large-scale online 3D model repos¬ 
itories have grown to tremendous sizes through public 
aggregation, including the Trimble 3D warehouse (above 
2.5M models in total). Turbosquid (300K models) and 
Yobi3D (IM models). Using data from these repositories, 
[33] built a dataset of ^ 130K models from over 600 
categories. More recently, ShapNet [1] annotated ~ 330K 
models from over 4K categories. Using geometric analysis 
techniques, they semi-automatically aligned 57K models 
from 55 categories by orientation. 

3D Object Detection Most 3D object detection methods 
are based on representing objects with discriminative fea¬ 


tures for points [5], patches [7] and parts [19, 26, 30], or by 
exploring topological structures [16, 3, 4]. More recently, 
3D models have been used for supervised learning of 
appearance and geometric structure. For example, [29] and 
[20] proposed similar methods that learn a 3D deformable 
part model and demonstrate superior performance for 
cars and chairs, respectively; [21] and [2] formulated an 
alignment problem and built key point correspondences 
between 2D images and rendered 3D views. In contrast to 
these prior efforts that use hand-designed models based on 
hand-crafted features, we use a CNN to learn a viewpoint 
estimation system directly from data. 

Synthesizing Images for Training Recently, [29, 19, 
14, 24] used 3D models to render images for training 
object detectors and viewpoint classifiers. They tweak the 
rendering parameters to maximize model usage, since they 
have a limited number of 3D models - typically below 
50 models per category and insufficient to capture the 
geometric and appearance variance of objects in practice. 
Leveraging 3D repositories, [20, 2] use 250 and L3K chair 
models respectively to render tens of thousands training 
images per model, which are then used to train deformable 
part models (DPM [9]). In our work, we synthesize several 
orders of magnitude more images than existing work. We 
also explore methods to increase data variation by changing 
background patterns, illumination, viewpoint, etc., which is 
critical for preventing overfitting of the CNN. 

While both [24] and our work connect synthetic im¬ 
ages with CNN, they are fundamentally different in task, 
approach and result. First, [24] focused on 2D object 
detection, whereas our work focuses on 3D viewpoint 
estimation. Second, [24] used a small set of synthetic 
images (2,000 in total) to train linear classifiers based 
on features extracted by out-of-the-box CNNs [17, 12]. 
In contrast, we develop a scalable synthesis pipeline and 
generate around 6 million images to learn geometric-aware 
features by training deep CNNs (initialized by [12]). Third, 
the performance of [24], though better than previous work 
using synthetic data, did not match RCNN baseline trained 
by real images [12]. In contrast, we show significant 
performance gains (Sec 5.2) over previous work [34] using 
full set of real data of PASCAL VOC 2012 (trainset). 

3. Problem Statement 

For an input RGB image, our goal is to estimate its 
viewpoint. We parameterize the viewpoint as a tuple 
of camera rotation parameters, where 0 is the 
azimuth, is the elevation, and fi) is the in-plane rotation. 
They are discretized in a fine-grained manner, with azimuth, 
elevation and in-plane rotation angles being divided into 
360, 180 and 360 bins respectively. The viewpoint 
estimation problem is formalized as classifying the camera 
rotation parameters into these fine-grained bins (classes). 



Figure 2. 3D model set augmentation by symmetry-preserving 
deformation. 


By adopting a fine-grained viewpoint classification 
formulation, our estimation is informative and accurate. 
Compared with regression-based formulations [22], our 
formulation returns the probabilities of each viewpoint, 
thus capturing the underlying viewpoint ambiguity possibly 
caused by symmetry or occlusion patterns. This informa¬ 
tion can be useful for further processing. Compared with 
traditional coarse-grained classification-based formulations 
that typically have 8 to 24 discrete classes [26, 34], our 
formulation is capable of producing much more fine¬ 
grained viewpoint estimation. 

4. Render for CNN System 

Since the space of viewpoint is discretized in a highly 
fine-grained manner, massive training data is required for 
the training of the network. We describe how we synthesis 
such large amount of training images in Sec 4.1, and how 
we design the network architecture and loss function for 
training the CNN with the synthesized images in Sec 4.2. 

4.1. Training Image Generation 

To generate training data, we augment real images by 
rendering 3D models. To increase the diversity of object 
geometry, we create new 3D models by deforming existing 
ones downloaded from a modestly-sized online 3D model 
repository. To increase the diversity of object appearance 
and background cluttemess, we design a synthesis pipeline 
by randomly sampling rendering parameters and adding 
random background patterns from scene images. 


Structure-preserving 3D Model Set Augmentation We 

take advantage of an online 3D model repository, ShapeNet, 
to collect seed models for classes of interest. The provided 
models are already aligned by orientation. For models that 
are bilateral or cylinder symmetric, their symmetry planes 



or axes are also already extracted. Please refer to [1] for 
more details of ShapeNet. 

From each of the seed models, we generate new models 
by a structure-preserving deformation. The problem of 
structure-preservation deformation has been widely studied 
in the field of geometry processing, and there exists many 
candidate models as in survey [23]. We choose a symmetry¬ 
preserving free-form deformation, defined via regularly 
placed control points in the bounding cube, similar to 
the approach of [27]. Our choice is largely due to the 
model’s simplicity and efficiency. More advanced methods 
can detect and preserve more structures, such as partial 
symmetry and rigidity [28]. 

To generate a deformed model from a seed model, 
we draw i.i.d samples from a Gaussian distribution for 
the translation vector of each control point. In addition, 
we regularize the deformation to set the translations of 
symmetric control points to be equal. Figure 2 shows 
example deformed models from our method. 

Overfit-Resistant Image Synthesis We synthesize a 
large number of images for each 3D model. Rather than 
pursuing realistic effect, we try to generate images of high 
diversity, so that we prevent the deep CNN from picking up 
unreliable patterns. 

We inject randomness in the three basic steps of our 
pipeline: rendering, background synthesis, and cropping. 

For image rendering, we explore two set of parameters, 
lighting condition and camera configuration. For the Light¬ 
ing condition, the number of light sources, their positions 
and energies are all sampled. For the camera extrinsics, 
we sample azimuth, elevation and in-plane rotation from a 
distribution estimated from a training dataset. Refer to the 
supplementary material for details. 

Images rendered as above have a fully transparent back¬ 
ground, and the object boundaries are highly contrasted. 
To prevent classifiers from overfitting such unrealistic 
boundary patterns, we synthesize the background by a 
simple and scalable approach. For each rendered image, 
we randomly sample an image from SUN397 dataset [35]. 




We use alpha-composition to blend a rendered image as 
foreground and a scene image as background. 

To teach CNN to recognize occluded or truncated 
images, we crop the image by a perturbed object bounding 
box. The cropping parameters are also learned from the 
training set. We find that the cropped patterns tend to 
be natural. For example, more bottom parts of chairs are 
cropped, since chair legs and seats are often occluded. 

Finally, we put together the large amount of synthetic 
images, together with a small amount of real images with 
ground truth human annotations, to form our training image 
set. The ground truth annotation of a sample s is denoted 
as {cs^Vs), where Cs is the class label of the sample, and 

G V is the the discretized viewpoint label tuple, and V is 
the space of discretized viewpoints. 

4.2. Network Architecture and Loss Function 

Class-Dependent Network Architecture. To effectively 
exploit this large-scale dataset, we need a model with 
sufficient learning capacity. CNNs are the natural choice 
for this challenge. We adopt the structure network of [18] 
as the starting point to design a novel architecture that fits 
our viewpoint estimation task. 

We found that the CNN trained for viewpoint estimation 
of one class do not perform well on another class, possibly 
due to the huge geometric variation between the classes. 
Instead, the viewpoint estimation classifiers are trained in a 
class-dependent way. However, a naive way of training the 
class-dependent viewpoint classifiers, i.e., one network for 
each class, cannot scale up, as the parameter of the whole 
system increases linearly with the number of classes. 

To address this issue, we propose a novel network 
architecture where the lower layers (both convolutional 
layers and fully connected layers) are shared by all classes, 
while the class-dependent layers are stacked over them (see 
Figure 4. Our network architecture design accommodates 
the fact that viewpoint estimation are class-dependent while 
maximizes the usage of the low level features shared across 
different classes to keep the overall network parameter 
number tractable. We initialize the shared convolutional 
and fully connected layers with the weights from [13]. 
During the training, all the shared convolutional and fully 
connected layers are fine-tuned, while the class-dependent 
fully connected layers are trained from scratch. 

Geometric Structure Aware Loss Function. The out¬ 
puts of the network, the (6>, 0, t/;) tuples, are geometric 
entities. We propose a geometric structure aware loss 
function to exploit their geometric constraints. We define 


^For simplicity, Pooling, Dropout, and ReLU layers are not shown. See 
the supplementary material for the full network definition. 
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Figure 4. Network architecture^. Our network architecture 
design accommodates the fact that viewpoint estimation are 
class-dependent, while maximizes the usage of the low level 
features shared across different classes to keep the overall network 
parameter number tractable. 

the viewpoint classification loss Lyp adapted from the soft- 
max loss as: 

Lvpiis}) = -J2Y1 log P„(s; c,), (1) 

{s} vev 

where Py{s;Cs) is the probability of view v for sample 
s from the soft-max viewpoint classifier of class c^, and 
d : V X V 1 -^ M is the distance between two viewpoints, 
defined to be the geodesic distance of points by (6>, 0) 
on a 2-sphere plus the £i distance of 'ip. By substituting 
an exponential decay weight w.r.t viewpoint distance for 
the mis-classification indicator weight in the original soft- 
max loss, we explicitly encourage correlation among the 
viewpoint predictions of nearby views. 

5. Experiments 

Our experiments are divided into four parts. First, 
we evaluate our viewpoint estimation system on the PAS- 
CAL3D-f data set [34] (Sec 5.2). Second, we visualize 
the structure of the learned viewpoint-discriminative feature 
space (Sec 5.3). Third, we perform control experiments to 
study the effects of synthesis parameters (Sec 5.4). Last, 
we show more qualitative results and analyze error patterns. 
(Sec G). Before we discuss experiment details, we first 
overview the 3D model set used in all the experiments. 

5.1. 3D Model Dataset 

As we discussed in Sec 2, there are several large-scale 
3D model repositories online. We download 3D models 
from ShapeNet [1], which has organized common daily ob¬ 
jects with categorization labels and joint alignment. Since 
we evaluate our method on the PASCAL 3D-i- benchmark, 
we download 3D models belonging to the 12 categories 
of PASCAL 3D-I-, including 30K models in total. After 
symmetry-preserving model set augmentation (Sec 4.1), we 
make sure that every category has lOK models. For more 
details, please refer to supplementary material. 









VOC 2012 val AVP 

aero 

bicycle 

boat 

bus 

car 

chair 

table 

mbike 

sofa 

train 

tv 

Avg. 

VDPM-4V [34] 

34.6 

41.7 

1.5 

26.1 

20.2 

6.8 

3.1 

30.4 

5.1 

10.7 

34.7 

19.5 

VDPM-8V 

23.4 

36.5 

1.0 

35.5 

23.5 

5.8 

3.6 

25.1 

12.5 

10.9 

27.4 

18.7 

VDPM-16V 

15.4 

18.4 

0.5 

46.9 

18.1 

6.0 

2.2 

16.1 

10.0 

22.1 

16.3 

15.6 

VDPM-24V 

8.0 

14.3 

0.3 

39.2 

13.7 

4.4 

3.6 

10.1 

8.2 

20.0 

11.2 

12.1 

DPM-VOC-FVP-4V [10] 

37.4 

43.9 

0.3 

48.6 

36.9 

6.1 

2.1 

31.8 

11.8 

11.1 

32.2 

23.8 

DPM-VOC+VP-8V 

28.6 

40.3 

0.2 

38.0 

36.6 

9.4 

2.6 

32.0 

11.0 

9.8 

28.6 

21.5 

DPM-VOC-KVP-16V 

15.9 

22.9 

0.3 

49.0 

29.6 

6.1 

2.3 

16.7 

7.1 

20.2 

19.9 

17.3 

DPM-VOC+VP-24V 

9.7 

16.7 

2.2 

42.1 

24.6 

4.2 

2.1 

10.5 

4.1 

20.7 

12.9 

13.6 

Ours-Joint-4V 

54.0 

50.5 

15.1 

57.1 

41.8 

15.7 

18.6 

50.8 

28.4 

46.1 

58.2 

39.7 

Ours-Joint-8 V 

44.5 

41.1 

10.1 

48.0 

36.6 

13.7 

15.1 

39.9 

26.8 

39.1 

46.5 

32.9 

Ours-Joint-16V 

27.5 

25.8 

6.5 

45.8 

29.7 

8.5 

12.0 

31.4 

17.7 

29.7 

31.4 

24.2 

Ours-Joint-24V 

21.5 

22.0 

4.1 

38.6 

25.5 

7.4 

11.0 

24.4 

15.0 

28.0 

19.8 

19.8 


Table 1. Simultaneous object detection and viewpoint estimation on PASCAL 3D+. The measurement is AVP (an extension of AP, 
where true positive stands only when bounding box localization AND viewpoint estimation are both correct). We show AVPs for four 
quantization cases of 360-degree views (into 4, 8, 16, 24 bins respectively, with increasing difficulty). Our method uses joint real and 
rendered images and trains a CNN tailored for this task. 


5.2. Comparison with state-of-the-art Methods 

We compare with state-of-the-art methods on PASCAL 
3D-I- benchmark. 

Methods in Comparison We compare with two baseline 
methods, VDPM [34] and DPM-VOC-i-VP [25], trained on 
real images from PASCAL 3D-\- VOC 2012 train set and 
tested on VOC 2012 val. 

For our method, we train on a combination of real images 
and synthetic images. We synthesized 20 images per model, 
which adds up to 200K images per category and in total 
2.4M images for all 12 classes. In our loss function (Eq (1)), 
we set cr = 1 by splitting 30% data for validation. 

Joint Detection and Viewpoint Estimation Following 
the protocol of [34, 25], we test on the joint detection and 
viewpoint estimation task. The bounding boxes of baseline 
methods are from their detectors and ours are from RCNN 
with bounding box regression [13]. The accuracy of RCNN 
detectors is shown in Tabel 2. 



Figure 5. Simultaneous object detection and viewpoint estima¬ 
tion performance. We compare mAVP of our models and the 
state-of-the-art methods. We also compare Ours-Real with Ours- 
Render and Ours-Joint (use both real and rendered images for 
training) to see how much rendered images can help. 


We use AVP (Average Viewpoint Precision) advocated 
by [34] as evaluation metric. AVP is the average precision 
with a modified true positive definition, requiring both 2D 
detection AND viewpoint estimation to be correct. 

Table 1 and Figure 5 summarize the results. We observe 
that our method trained with a combination of real images 
and rendered images significantly outperform the baselines 
by a large margin, from a coarse viewpoint discretization 
(4V) to a fine one (24V), in all object categories. 

Viewpoint Estimation One might argue that we achieve 
higher AVP due to the fact that RCNN has a higher 
2D detection performance. So we also directly compare 
viewpoint estimation performance using the same bounding 
boxes. We do two groups of comparisons on viewpoint 
estimation. To study the accuracy on detection bounding 
boxes, the first group of comparison uses detection bound¬ 
ing boxes. In the second group of comparison, we study the 
accuracy on ground truth bounding boxes. 

We first show the comparison results with VDPM, using 
the bounding boxes from RCNN detection. For two sets 
from detection, only correctly detected bounding boxes 
are used (50% overlap threshold). The evaluation metric 
is a continuous version of viewpoint estimation accuracy, 
i.e., the percentage of bounding boxes whose prediction is 
within 0 degrees of the ground truth. 

Figure 6 summarizes the results. Again, our method is 
significantly better than VDPM on all sets. In particular, the 
median of the viewpoint estimation error for our method is 
14°, which is much less than VDPM, being 57°. 

Next we show the performance comparison using ground 
truth bounding boxes (Table 3). We compare with a recent 
work from [31], which uses a similar network architecture 
(TNet) as ours except the loss layer. Note that the viewpoint 
estimation in this experiment includes azimuth, elevation 
and in-plane rotation. We use the same metric as in [31]. 
For the details of the metric definition, please refer to [31]. 


































aero bike 

boat 

bottle 

bus 

car chair 

table 

mbike 

sofa 

train 

tv 

mean 

AP 74.0 66.7 

32.9 

31.5 

68.0 

58.4 26.9 

39.3 

71.5 

44.2 

63.1 

63.7 

54.5 


Table 2. Average Precision (AP) on VOC 12 val with R-CNN with bounding box regression. 



Figure 6. Viewpoint estimation performance on detected object 
windows on VOC 2012 val. Left: mean viewpoint estimation 
accuracy as a function of azimuth angle error 6e. Viewpoint is 
correct if distance in degree between prediction and groundtruth 
is less than 5e. Right: medians of azimuth estimation errors (in 
degree), lower the better. 


From the results, it is clear that our methods significantly 
outperforms the baseline CNN. 

To evaluate the effect of synthetic data versus real 
data, we also compare model trained with real images 
(Ours-Real) and model trained with rendered images (Ours- 
Render). For Ours-Real-vanilla and Ours-Real, we flip all 
VOC 12 train set images to augment the dataset and for 
Ours-Real we use geometric aware loss (Eq. 1). For Ours- 
Render, we only use synthetic images for training. In Fig¬ 
ure 6, we see an 32% median azimuth error decrement from 
Ours-Real (23.5°) to Ours-Render (16°). By combining 
two data sources (we simultanously feed the network with 
real and rendered images but assign them with different 
weights), we get another 2° less error. 

Furthermore, to show the benefits of having a fine¬ 
grained viewpoint estimation formulation, we take top-2 
viewpoint proposals with the highest confidences in local 
area. Figure 6 left shows that having top-2 proposals sig¬ 
nificantly improve mVP when azimuth angle error is large 
(around 15% improvement compared with top-1 method 
Ours-Joint). The top-2 improvement can be understood by 
observing ambiguous cases in Figure 8, where CNN gives 
two or multiple high probability proposals and many times 
one of them is correct. 











A 5 

1 yf \ 

. .tv- < 


> .• rff r V. 

W* • • 

6 'rfV. • /•'J#,- 


Figure 7. Feature space visualization for R-CNN and our 
view classification CNN. We visualize features of cropped car 
images extracted by R-CNN (left) and our CNN (right) by t-SNE 
dimension reduction. Each feature point of an image is marked 
by a color corresponding to the cluster defined by its quantized 
azimuth angle (8 bins for [0, 27r)). For each cluster, its center is 
labeled on the plot by the id. 


reduction [32], using “car” as an example. As a comparison, 
we also show the feature space from R-CNN over the same 
set of images. Interestingly, we observe that viewpoint- 
related patterns in our feature space is much stronger than 
R-CNN feature space: 1) images from similar views are 
clustered together; 2) images of symmetric views (such 
as 0° vs 180° tend to be closer; 3) the features form a 
double loop. In additon, as the feature point moves in 
the clock-wise direction, the viewpoint also moves clock- 
wisely around the car. Such observations exactly reflect the 
nature we discussed at the beginning of this paragraph. As 
a comparison, there is no obvious viewpoint pattern for R- 
CNN feature space. 

5.4. Synthesis Parameter Analysis 

In this section we show the results of our control exper¬ 
iments, which analyze the importance of different factors 
in our image synthesis pipeline. The control experiments 
focus on the chair category, since it is challenging by the 
diversity of structure. We first introduce the five testbed 
data sets and the evaluation metrics. 


5.3. Learned Feature Space Visualization 

The viewpoint estimation problem has its intrinsic diffi¬ 
culty, due to factors such as object symmetry and similarity 
of object appearance at nearby views. Since our CNN 
can well predict viewpoints, we expect the structure of our 
CNN feature space to refiect this nature. In Figure 7, we 
visualize the feature space of our CNN^ in 2D by dimension 

^The output of the last fully connected layer. 


Experimental Setup We refer to the test datasets using 
the following short names: 1) clean: 1026 images from the 
web, with relatively clean backgrounds but no occlusion, 
e.g., product photos in outdoor scenes. 2) cluttered: 1000 
images from the web, with heavy clutter in the background 
but no occlusion. 3) ikea: 200 images of chairs photoed 
from an IKEA department store, with strong background 
clutter but no occlusion. 4) VOC-easy: 247 chair images 
from PASCAL VOC 12 val, no occlusion, no truncation, 

















aero 

bike 

boat 

bottle 

bus 

car 

chair 

table 

mbike 

sofa 

train 

tv 

mean 

Acc^ (Tulsiani, Malik) 

0.78 

0.74 

0.49 

0.93 

0.94 

0.90 

0.65 

0.67 

0.83 

0.67 

0.79 

0.76 

0.76 

Acc^ (Ours-Render) 

0.74 

0.83 

0.52 

0.91 

0.91 

0.88 

0.86 

0.73 

0.78 

0.90 

0.86 

0.92 

0.82 

MedErr (Tulsiani, Malik) 

14.7 

18.6 

31.2 

13.5 

6.3 

8.8 

17.7 

17.4 

17.6 

15.1 

8.9 

17.8 

15.6 

MedErr (Ours-Render) 

15.4 

14.8 

25.6 

9.3 

3.6 

6.0 

9.7 

10.8 

16.7 

9.5 

6.1 

12.6 

11.7 


Table 3. Viewpoint estimation with ground truth bounding box. Evaluation metrics are defined in [31], where Accil measures accuracy (the 
higher the better) and MedErr measures error (the lower the better). Model from Tulsiani, Malik [31] is based on TNet, a similar network 
architecture as ours except the loss layer. While they use real images from both VOC 12 val and ImageNet for training, Ours-Render only 
uses rendered images for training. 


non difficult chair images. 5) VOC-all: all 1449 chair 
images from PASCAL VOC 12 val. While the clean 
and cluttered sets exhibit a strong non-uniform viewpoint 
distribution bias, the VOC-easy and VOC-all set have a 
similar tendency with weaker strength. The ikea dataset 
has close-to-uniform viewpoint distribution. All images are 
cropped by ground truth bounding boxes. The groundtruth 
for the clean, cluttered and ikea dataset are provided by the 
authors, those for VOC-easy and VOC-all are by PASCAL 
VOC 12. Usage of these five datasets instead of just one 
or two of them is to make sure that our conclusion is not 
affected by dataset bias. 

Unless otherwise noted, our evaluation metric is a 
discrete viewpoint accuracy with ’’tolerance”, denoted as 
16Vto/- Specifically, we collect 16-classes viewpoint 
annotations (each of which corresponds to a 22.5° slot) 
for all the data sets we described above. As for testing, 
if the prediction angle is within the label slot or off by 
one slot (tolerance), we count it as correct. The tolerance 
is necessary since labels may not be accurate in our 
small scale human experiments for 16-classes viewpoint 
annotation"^. 

Effect of Synthetic Image Quantity We separately fine 
tune multiple CNNs with different volumes of rendered 
training images. We observe that the accuracy keeps 
increasing as the training set volume grows (Table 4). The 
observation confirms that more training data from synthesis 
does help the training of the CNN, and that the potential of 
3D models to render large amounts of images is useful. 

Effect of Model Collection Size We keep the total 
number of rendered training image fixed at 6928 * 128 = 
886, 784 and change the number of 3D models used for 
training data synthesis. In Table 6 we see that as the model 
collection size increases, system performance continually 
increases. 

^Accurate continuous viewpoint labels on PASCAL 3D+ are obtained 
by a highly expensive approach of matching key points between images 
and 3D models. We do not adopt that approach due to its complexity. 
Instead, we simply ask the annotator to compare with reference images. 


images 

per model 

clean 

clutter 

ikea 

VOC- 

easy 

VOC- 

all 

avg. 

16 

89.1 

92.2 

92.9 

77.7 

46.9 

79.8 

32 

93.4 

93.5 

95.9 

81.8 

48.8 

82.7 

64 

94.2 

94.1 

95.9 

84.6 

48.7 

83.5 

128 

94.2 

95.0 

96.9 

85.0 

50.0 

84.2 


Table 4. Effect of synthetic image quantity. Numbers are IGVtoi 

(16 view accuracy with tolerance). Prediction is deemed correct 
if it is in the same or adjacent viewpoint slot of the ground truth 
label 



clean 

clutter 

ikea 

VOC- 

easy 

VOC- 

all 

avg. 

nobkg 

95.4 

93.1 

86.2 

78.1 

48.5 

80.3 

bkg 

94.2 

95.0 

96.9 

85.0 

50.0 

84.2 


Table 5. Effect of background synthesis. Numbers are IGVtoi 
(16 view accuracy with tolerance). 


Effect of Background Clutter As objects in the real 
world are often observed in cluttered scenes, we expect 
the network to perform better when training on images 
with synthetic backgrounds. To evaluate our hypothesis, 
we design two experiments for comparison. In Table 5, 
we can see that nobkg group (trained on rendered images 
with no background, i.e., black background) performs 
worse than the bkg group (trained on rendered images with 
a synthetic background - cropped images from a scene 
database) especially in the ikea, VOC-easy and VOC-all 
data sets, which are more similar to daily scenes with lots of 
clutter. We also notice that the nobkg group performs better 
in the clean data set. This is reasonable since the nobkg 
group network has been working hard on clean background 
cases. 


num 

models 

clean 

clutter 

ikea 

VOC- 

easy 

VOC- 

all 

avg. 

91 

87.4 

84.9 

89.8 

74.9 

44.9 

76.4 

1000 

92.7 

92.6 

94.9 

83.0 

49.0 

82.4 

6928 

94.2 

95.0 

96.9 

85.0 

50.0 

84.2 


Table 6. Effect of 3D model collection size. Numbers are IGVtoi 
(16 view accuracy with tolerance). The 91 models are cluster 
centers of a K-means clustering of the 6928 models. The 1000 
models are randomly chosen. 






































Figure 8. Viewpoint estimation example results. The bar under each image indicates the 360-class confidences (black means high 
confidence) corresponding to 0° ~ 360° (with object facing towards us as 0° and rotating clockwise). The red vertical bar indicates the 
ground truth. The first half are positive cases, the lower half are negative cases (with red box surrounding the image). 


5.5. Qualitative Results 

Besides azimuth estimation, our system also has the 
ability to estimate the elevation and in-plane rotation of the 
camera. To visualize this ability, Figure 9 shows examples 


by model insertion for objects detected by R-CNN. The 
inserted 3D models are searched from our library by 
similarity. For detailed quantitative evaluation of elevation 
and in-plane rotation, please refer to our supplementary 





















































































Figure 9. 3D model insertion. 3D viewpoint recovery is essential 
for 3D recognition. Here we demonstrate that the recovered 
viewpoint can be used for narrowing down the search space of 
model retrieval, enables 3D model insertion into 2D images. 

material. 

Figure 8 shows more representative examples of our 
system. For each example, we show the cropped image by 
a bounding box and the confidence of all 360 views. Since 
viewpoint classifiers are regularized by our geometry-aware 
loss and sharing lower layers, the network learns about 
correlations among viewpoints. We observe interesting 
patterns. First of all, for simple cases, our system usually 
correctly outputs a clear single peak. Second, for those 
challenging cases, even though our system may fail, there 
is usually still a lower peak around the groundtruth angle, 
validated both by the examples and our experiment results 
presented in Figure 6. Besides, higher level systems (e.g. 
3D model alignment, keypoint detector) can use those 
proposals to save search space and increase accuracy. This 
proposing ability is not available for a regression system. 

We observe several typical error patterns in our results: 
occlusion, multiple objects, truncation, and ambiguous 
viewpoint. Figure 8 illustrates those patterns by examples. 
For cases of occlusion the system sometimes gets confused, 
where the 360 classes probabilities figure looks messy (no 
clear peaks). For cases of ambiguous viewpoints, there are 
usually two peaks of high confidences, indicating the two 
ambiguous viewpoints (e.g. a car facing towards you or 
opposite to you). For cases of multiple objects, the system 
often shows peaks corresponding to viewpoints of those 
objects, which is very reasonable results after all. 

6. Conclusion 

We demonstrated that images rendered from 3D models 
can be used to train CNN for viewpoint estimation on 
real images. Our synthesis approach can leverage large 
3D model collections to generate large-scale training data 
with fully annotated viewpoint information. Critically, we 
can achieve this with negligible human effort, in stark 
contrast to previous efforts where training datasets have to 
be manually annotated. 

We showed that by carefully designing the data synthesis 
process our method can significantly outperform existing 
methods on the task of viewpoint estimation on 12 object 


classes from PASCAL 3 D-f. We conducted extensive exper¬ 
iments to analyze the effect of the synthesis parameters and 
the input dataset scale on the performance of our system. 

In general, we envision render for CNN an promising 
direction as it not only enables efficient training, but also 
opens the potential for doing highly controlled experiments, 
and might lead to deeper understand of it. 
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evaluation of azimuth estimation (Sec B), as well as a 
quantitative evaluation of elevation and in-plane rotation 
estimation (Sec C). We then provide more technical details 
for the synthesis pipeline (Sec D) and network architecture 
(Sec E). In Sec F, we provide more details of our 3D 
model dataset. Lastly, we show more example visualization 
(Sec G). 



Figure 10. Viewpoint estimation performance using positive 
VDPM detection windows on PASCAL VOC 2012 val. Left: 
mean viewpoint estimation accuracy as a function of azimuth 
angle error Sq. Viewpoint is correct if distance in degree between 
prediction and groundtruth is less than Se. Right: medians of 
azimuth estimation errors (in degree), the lower the better. Refer 
to Table 7 for settings of methods in comparison. 


B. Comparison over VDPM for Viewpoint 
Estimation by VDPM Bounding Box (Sec 
5.2) 

In Figure 5 of the main paper, we compare viewpoint 
estimation performance of our methods versus VDPM, 
using detection windows from R-CNN. Here, we compare 
these methods again, using detection windows from VDPM, 
i.e., to estimate the viewpoint of objects in detection 
windows from VDPM. To make this material more self- 
contained, we summarize the settings of all methods in 
Table 7. 

Results are shown in Figure 6. The trend is unchanged, 
except that in Figure 6 VDPM (16V) is slightly better 
than Real-vanilla (model trained with real images, without 
using our new loss function). Note that R-CNN detects 
many more difficult cases than VDPM in terms of occlusion 
and truncation. In other words. Figure 6 focuses on the 
comparison of our methods and VDPM over simple cases. 


Appendix 

A. Organization of Appendix 

This document provides additional quantitative results, 
technical details, and example visualizations to the main 
paper. 

Here we describe the document organization. We start 
from providing more quantitative results, including a further 


C. Quantitative Results on elevation and in¬ 
plane rotation (Sec 5.5) 

In Figure 11, we show results on elevation and in-plane 
rotation estimation. Since most objects tend to have small 
elevation and in-plane rotation variations, the range of those 
two parameters are smaller compared with azimuth angle, 
thus the viewpoint estimation accuracy is also higher. 
















name 

features learned 
from data 

trained with 
real data 

trained with 
synthetic data 

geometric structure 
aware loss function 

VDPM 

no (HoG) 

yes 

no 

no(16DPMs) 

Real-vanilla 

yes 

yes 

no 

no 

Ours-Real 

yes 

yes 

no 

yes 

Ours-Render 

yes 

no 

yes 

yes 

Ours-Joint 

yes 

yes 

yes 

yes 

Ours-Joint-Top2 

yes 

yes 

yes 

yes 


Table 7. Summary of settings for methods in Figure 5 and Figure 6 of the main paper, and Figure 10 of the appendix. Note that the 
Ours-Joint-Top2 takes the top-2 viewpoint proposals with the highest confidences after non-maximum suppression. 



3D Model 



Figure 12. Scalable and overfit-resistant synthesis pipeline. 




Figure 11. Viewpoint estimation precision on elevation and in¬ 
plane rotation. We show test results on VOC val in PASCAL3D-1-. 
Our model here is trained with rendered images only. Left: 
Viewpoint precision for elevation. Right: Viewpoint precision for 
in-plane rotation. 


D. Synthesis Pipeline Details (Sec 4.1) 

The synthesis pipeline is illustrated in Figure 12. 


3D Model Normalization Following the convention of 
PASCAL 3D-\- annotation, all 3D models are normalized to 
have the bounding cube centered at the origin and diagonal 
length 1. 


Rendering The parameters that affect the rendering in¬ 
clude lighting condition, camera extrinsics and intrincis. To 
render each image, we sample a set of these parameters 
from a distribution as follows: 


• Lighting condition. We add N point lights and enable 
the environmental light. N is uniformly sampled from 
1 to 10. All lighting parameters are sampled i.i.d. The 
position Plight is uniformly sampled on a sphere of 
radius 14.14, between latitude 0° and 60°; the energy 
E ^ A/’(4,3); the color is fixed to be white. 

• Camera extrinsics. We first describe the camera 
position parameters (i.e., translation parameters). In 
the polar coordinate system, let p be the distance of the 
optical center to the origin, 0 and f be the longitude 
and latitude, respectively. We use kernel density 
estimation (KDE) to estimate the non-parametric dis¬ 
tributions of p, 6>, 0 for each category respectively, 
from the VOC 12 train set of PASCAL3D-F dataset. We 
then use the estimated distributions to generate i.i.d. 
samples for rendering. 

We then describe the camera pose parameters (i.e., 
rotation parameters). Similar to the position param¬ 
eters, we use KDE to estimate distribution of in-plane 
rotation for each category and generate i.i.d. samples 
for rendering. We set the camera to look at the origin, 
the image plane perpendicular to the ray from the 
optical center to the origin. 

• Camera intrinsics. The focal length and aspect ratio 
are fixed to be 35 and 1.0, respectively. 









































We use Blender, an open-source 3D graphics and 
animation software, to efficiently render each 3D model. 

Background synthesis All details are described in the 
main paper. 



Figure 13. Cropping parameter estimation. We estimate the 
cropping parameters by comparing the groundtruth bounding box 
and full object bounding box. The groundtruth bounding box 
(green) is from PASCAL 3D+. The full object bounding box 
(red) is estimated by us as follows: because the real training 
image dataset (PASCAL 3D+) has provided landmark registrations 
between each object instance and a similar 3D model, we can 
project the 3D model to the image space and estimate the full 
bounding box. 

Cropping As seen in Figure 13, we use annotations 
provided by PASCAL3D+ to get truncation patterns of 
objects in real images. For each real training image, 
we project 3D model corresponding to the object back to 
the image and then get a bounding box for the projected 
model (full object bounding box). Then, by comparing 
with the provided groundtruth bounding box of the object, 
we know how the object is truncated. Specifically, we 
know the relative position of four edges between full box 
and groundtruth box. We use kernel density estimation to 
learn non-parametric distribution of these relative positions 
for each category and generate samples for croppings of 
rendered images. 

E. Network Details (Sec 4.2) 

We adapt network architecture from R-CNN to object 
viewpoint estimation. For notation, conv means convolu¬ 
tional layer including pooling and ReLu. fc means fully 
connected layer. The number following conv or fc, starting 
from 1 as bottom, means order of layer. We keep structures 
of convolutional layers and fc6 and fc7 fully connected 


layers consistent with R-CNN network. Depending on the 
number of layers fine tuned (for example, we fine tune 
layers above, not including, conv3), we can use shared 
lower layers for both detection and viewpoint estimation, 
which reduces computation cost. The last fully connected 
layer is object category specific. All categories share layers 
below and including fc7 (trained by viewpoint annotations, 
fc7 features now preserve geometric information about 
the image). On top of fc7, each category has a fully 
connected layer with 8640 neurons (4320 for azimuth + 
2160 for elevation + 2160 for in-plane rotation). We use 
the geometric structure aware loss mentioned in the main 
paper as the loss layer. During back propagation, only 
viewpoint losses from the object category of the instance 
will be counted. 


Name 

synset offset 

num 

aeroplane 

n02691156 

4045 

bicycle 

n02834778 

59 

boat 

n04530566 

1939 

bottle 

n02876657 

498 

bus 

n02924116 

939 

car 

n02958343 

7497 

chair 

n03001627 

6928 

dining table 

n04379243 

3650 

motorbike 

n03790512 

337 

sofa 

n04256520 

3173 

train 

n04468005 

389 

tv monitor 

n03211117 

1095 


Table 8. Statistics of models used in the paper. 


F. More Details on 3D Model Set (Sec 5.1) 

Table 8 lists the statistics of the models used in the main 
paper. All models are downloaded from ShapeNet, which 
is organized by the taxonomy of WordNet. Models are also 
pre-aligned to have consistent orientation by ShapeNet. In 
WordNet (and ShapeNet), each category is indexed by a 
unique id named “synset offset”. 

G. More Examples 

See next pages for examples of positive and negative 
results. The negative results are grouped by error patterns. 
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Positive Examples 
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• Gray bar indicates the viewpoint estimation confidence. The darker, the higher. 

• Red vertical line in the gray bar indicates the groundtruth viewpoint 


table chair 




Occlusion 


table occluded by chair 




Gray bar indicates the viewpoint estimation confidence. The darker, the higher. 
Red vertical line in the gray bar indicates the groundtruth viewpoint 


Multiple Objects 






Gray bar indicates the viewpoint estimation confidence. The darker, the higher. 
Red vertical line in the gray bar indicates the groundtruth viewpoint 


Ambiguity 




• Gray bar indicates the viewpoint estimation confidence. The darker, the higher. 

• Red vertical line in the gray bar indicates the groundtruth viewpoint 


Tmneation 






Gray bar indicates the viewpoint estimation confidence. The darker, the higher. 
Red vertical line in the gray bar indicates the groimdtruth viewpoint 



























































